"
]
},
"metadata": {}
}
]
},
{
"cell_type": "code",
"source": [
"def plot_dataset(X, y):\n",
" plt.plot(X[:, 0][y==0], X[:, 1][y==0], \"bs\")\n",
" plt.plot(X[:, 0][y==1], X[:, 1][y==1], \"g^\")\n",
" #plt.axis(axes)\n",
" plt.grid(True, which='both')\n",
" plt.xlabel(\"$x_1$\")\n",
" plt.ylabel(\"$x_2$\", rotation=0)\n",
"\n",
"def plot_decision_boundary(clf, X, y, alpha=1.0):\n",
" axes=[-1.5, 2.4, -1, 1.5]\n",
" x1, x2 = np.meshgrid(np.linspace(axes[0], axes[1], 100),\n",
" np.linspace(axes[2], axes[3], 100))\n",
" X_new = np.c_[x1.ravel(), x2.ravel()]\n",
" y_pred = clf.predict(X_new).reshape(x1.shape)\n",
" \n",
" plt.contourf(x1, x2, y_pred, alpha=0.3 * alpha, cmap='Wistia')\n",
" plt.contour(x1, x2, y_pred, cmap=\"Greys\", alpha=0.8 * alpha)\n",
" colors = [\"#78785c\", \"#c47b27\"]\n",
" markers = (\"o\", \"^\")\n",
" for idx in (0, 1):\n",
" plt.plot(X[:, 0][y == idx], X[:, 1][y == idx],\n",
" color=colors[idx], marker=markers[idx], linestyle=\"none\")\n",
" plt.axis(axes)\n",
" plt.xlabel(r\"$x_1$\")\n",
" plt.ylabel(r\"$x_2$\", rotation=0)"
],
"metadata": {
"id": "ruWWKAR_Mz5P"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Voting classifier"
],
"metadata": {
"id": "LTrx75jqLcLg"
}
},
{
"cell_type": "markdown",
"source": [
"`Scikit-Learn` provides a `VotingClassifier` class that’s quite easy to use: just give it a list of name/predictor pairs, and use it like a normal classifier, that’s it! Let’s try it on the moons dataset (this is a toy dataset for binary classification in which the data points are shaped as two interleaving crescent moons). We will load and split the moons dataset into a training set and a test set, then we’ll create and train a voting classifier composed of three diverse classifiers:"
],
"metadata": {
"id": "PvL__u5fL5Sw"
}
},
{
"cell_type": "code",
"source": [
"X, y = make_moons(n_samples=500, noise=0.30, random_state=42)\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\n",
"plot_dataset(X, y)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 281
},
"id": "-AB6jsQHLfpk",
"outputId": "f4b4489f-8b3b-4f2e-942a-49641835aa4a"
},
"execution_count": null,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"Here we use [`SciKeras`](https://github.com/adriangb/scikeras) to wrap kerase model into `Scikit-Learn`."
],
"metadata": {
"id": "qmJ0_x9ezBAO"
}
},
{
"cell_type": "code",
"source": [
"def get_model():\n",
" model= keras.models.Sequential([keras.layers.Dense(30,activation='relu',input_shape=[2]),\n",
" keras.layers.Dense(20,activation='relu'),\n",
" keras.layers.Dense(1,activation='sigmoid')\n",
" ])\n",
" model.compile(optimizer='NAdam',loss='binary_crossentropy',metrics=['accuracy'])\n",
" return model"
],
"metadata": {
"id": "DSARLteqwsY-"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"dense_model = KerasClassifier(model=get_model, epochs=200, verbose=False)"
],
"metadata": {
"id": "mufRkMoGxOuu"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"voting_clf = VotingClassifier(\n",
" estimators=[\n",
" ('lr', LogisticRegression(random_state=42)),\n",
" ('rf', RandomForestClassifier(random_state=42)),\n",
" ('svc', SVC(random_state=42)),\n",
" ('dense', dense_model)\n",
" ]\n",
")\n",
"voting_clf.fit(X_train, y_train)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "z2-c2OA5M4aG",
"outputId": "133a4343-5ec0-45d7-e7f3-7b8330d9a83d"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"VotingClassifier(estimators=[('lr', LogisticRegression(random_state=42)),\n",
" ('rf', RandomForestClassifier(random_state=42)),\n",
" ('svc', SVC(random_state=42)),\n",
" ('dense',\n",
" KerasClassifier(epochs=200, model=, verbose=False))])"
]
},
"metadata": {},
"execution_count": 109
}
]
},
{
"cell_type": "markdown",
"source": [
"When you fit a `VotingClassifier`, it clones every estimator and fits the clones. The original estimators are available via the `estimators` attribute, while the fitted clones are available via the `estimators_` attribute. If you prefer a dict rather than a list, you can use `named_estimators` or `named_estimators_` instead. For example, let’s look at each fitted classifier’s accuracy on the test set:"
],
"metadata": {
"id": "HMfYwrvVNeH3"
}
},
{
"cell_type": "code",
"source": [
"for name, clf in voting_clf.named_estimators_.items():\n",
" print(name, \"=\", clf.score(X_test, y_test))"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "1Y1o7PFQNIPg",
"outputId": "7853f87b-55f2-4ab1-8482-33fbef160914"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"lr = 0.864\n",
"rf = 0.896\n",
"svc = 0.896\n",
"dense = 0.896\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"When you call the voting classifier’s `predict()` method, it performs hard voting. For example, the voting classifier predicts class 1 for the first instance of the test set, because 3 out of 4 classifiers predict that class:"
],
"metadata": {
"id": "pcfc8mtjNywf"
}
},
{
"cell_type": "code",
"source": [
"voting_clf.predict(X_test[:1]), [clf.predict(X_test[:1]) for clf in voting_clf.estimators_]"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "2TrG5WLoNvGl",
"outputId": "26dedde2-204a-469f-d971-b87ad8dc555d"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(array([1]), [array([1]), array([1]), array([0]), array([1])])"
]
},
"metadata": {},
"execution_count": 111
}
]
},
{
"cell_type": "markdown",
"source": [
"Now let’s look at the performance of the voting classifier on the test set:"
],
"metadata": {
"id": "FfwmQiZiN-7P"
}
},
{
"cell_type": "code",
"source": [
"voting_clf.score(X_test, y_test)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "R-4gZYXnN3Q0",
"outputId": "c64c50e5-7c31-4e8b-d8c3-6ca1e6fb6021"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.88"
]
},
"metadata": {},
"execution_count": 112
}
]
},
{
"cell_type": "code",
"source": [
"plot_decision_boundary(voting_clf, X_train, y_train)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 285
},
"id": "vQznlFPfU6aY",
"outputId": "0fe2bafe-b182-4970-83fc-4ed8decdaf39"
},
"execution_count": null,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"If all classifiers are able to estimate class probabilities (i.e., they all have a `predict_proba()` method), then you can tell `Scikit-Learn` to predict the class with the **highest class probability, averaged over all the individual classifiers.** This is called **soft voting**. It often achieves higher performance than hard voting because it gives more weight to highly confident votes. All you need to do is set the voting classifier’s voting hyperparameter to \"soft\", and ensure that all classifiers can estimate class probabilities. \n",
"\n",
"This is not the case for the SVC class by default, so you need to set its probability hyperparameter to True (this will make the SVC class use cross-validation to estimate class probabilities, slowing down training, and it will add a `predict_proba()` method). Let’s try that:"
],
"metadata": {
"id": "0OCmMXDAOEsI"
}
},
{
"cell_type": "code",
"source": [
"voting_clf.voting = \"soft\"\n",
"voting_clf.named_estimators[\"svc\"].probability = True\n",
"voting_clf.fit(X_train, y_train)\n",
"voting_clf.score(X_test, y_test)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "tex4mcrIOBQ2",
"outputId": "5dde7cd6-8849-41c6-e165-5a21e7e2aa00"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.912"
]
},
"metadata": {},
"execution_count": 114
}
]
},
{
"cell_type": "markdown",
"source": [
"We reach 91.2% accuracy simply by using soft voting, not bad!"
],
"metadata": {
"id": "I__SqQ0lOdco"
}
},
{
"cell_type": "markdown",
"source": [
"For stacking neural ntwork model, you can refer to https://ensemble-pytorch.readthedocs.io/en/latest/ for more details."
],
"metadata": {
"id": "R7wMvsUO8t9V"
}
},
{
"cell_type": "markdown",
"source": [
"## Stacking"
],
"metadata": {
"id": "PBHcV002qZ1u"
}
},
{
"cell_type": "markdown",
"source": [
"`Scikit-Learn` provides two classes for stacking ensembles: `StackingClassifier` and `StackingRegressor`. For example, you can replace the `VotingClassifier` you used on the moons dataset with a `StackingClassifier`:"
],
"metadata": {
"id": "DP5DoUCdqbaN"
}
},
{
"cell_type": "code",
"source": [
"stacking_clf = StackingClassifier(\n",
" estimators=[\n",
" ('lr', LogisticRegression(random_state=42)),\n",
" ('rf', RandomForestClassifier(random_state=42)),\n",
" ('svc', SVC(probability=True, random_state=42))\n",
" ],\n",
" final_estimator=RandomForestClassifier(random_state=43),\n",
" cv=5 # number of cross-validation folds\n",
")\n",
"stacking_clf.fit(X_train, y_train)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "1rqpE6FZqlN1",
"outputId": "6de014a5-d7c7-4bac-d60b-6c42c6644494"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"StackingClassifier(cv=5,\n",
" estimators=[('lr', LogisticRegression(random_state=42)),\n",
" ('rf', RandomForestClassifier(random_state=42)),\n",
" ('svc', SVC(probability=True, random_state=42))],\n",
" final_estimator=RandomForestClassifier(random_state=43))"
]
},
"metadata": {},
"execution_count": 34
}
]
},
{
"cell_type": "markdown",
"source": [
"For each predictor, the stacking classifier will call `predict_proba()` if available, or it will fallback to `decision_function()` if available, or as a last resort it will call `predict()`. If you don’t provide a final estimator, `StackingClassifier` will use `LogisticRegression`, and `StackingRegressor` will use `RidgeCV`."
],
"metadata": {
"id": "08TA3D47qsWN"
}
},
{
"cell_type": "code",
"source": [
"stacking_clf.score(X_test, y_test)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "7ms6ZeGCq4_W",
"outputId": "6147afb3-260d-491c-d36c-dbe82cd0c620"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.928"
]
},
"metadata": {},
"execution_count": 35
}
]
},
{
"cell_type": "markdown",
"source": [
"You get 92.8% accuracy! which is a bit better than the voting classifier using soft voting, which got 92%."
],
"metadata": {
"id": "pIeOuOsOq_N-"
}
},
{
"cell_type": "markdown",
"source": [
"## Baaging and Pasting"
],
"metadata": {
"id": "I97bOx8CTlf3"
}
},
{
"cell_type": "markdown",
"source": [
"`Scikit-Learn` offers a simple API for both bagging and pasting with the `BaggingClassifier` class (or `BaggingRegressor` for regression). The following code trains an ensemble of 500 **Decision Tree classifiers**: each is trained on 100 training instances randomly sampled from the training set with replacement (this is an example of bagging, but if you want to use pasting instead, just set `bootstrap=False`). The `n_jobs` parameter tells `Scikit-Learn` the number of CPU cores to use for training and predictions, and –1 tells `Scikit-Learn` to use all available cores."
],
"metadata": {
"id": "QzERUKToTov_"
}
},
{
"cell_type": "code",
"source": [
"bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,\n",
" max_samples=100, random_state=42)\n",
"bag_clf.fit(X_train, y_train)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "x43B6MDAObRX",
"outputId": "5e4c6379-1a36-497d-a99f-f949e3e1c8fa"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=100,\n",
" n_estimators=500, random_state=42)"
]
},
"metadata": {},
"execution_count": 13
}
]
},
{
"cell_type": "markdown",
"source": [
"Notice that the `BaggingClassifier` automatically performs **soft voting** instead of hard voting if the base classifier can estimate class probabilities (i.e., if it has a `predict_proba()` method), which is the case with Decision Tree classifiers."
],
"metadata": {
"id": "BF7-HM3KT99Q"
}
},
{
"cell_type": "markdown",
"source": [
"We compares the decision boundary of a single Decision Tree with the decision boundary of a bagging ensemble of 500 trees (from the preceding code), both trained on the moons dataset. As you can see, the ensemble’s predictions will likely generalize much better than the single Decision Tree’s predictions: the ensemble has a comparable bias but a smaller variance (it makes roughly the same number of errors on the training set, but the decision boundary is less irregular)."
],
"metadata": {
"id": "4lhj-r7nUMn3"
}
},
{
"cell_type": "code",
"source": [
"tree_clf = DecisionTreeClassifier(random_state=42)\n",
"tree_clf.fit(X_train, y_train)\n",
"\n",
"fig, axes = plt.subplots(ncols=2, figsize=(10, 4), sharey=True)\n",
"plt.sca(axes[0])\n",
"plot_decision_boundary(tree_clf, X_train, y_train)\n",
"plt.title(\"Decision Tree\")\n",
"plt.sca(axes[1])\n",
"plot_decision_boundary(bag_clf, X_train, y_train)\n",
"plt.title(\"Decision Trees with Bagging\")\n",
"plt.ylabel(\"\")\n",
"plt.show()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 297
},
"id": "8wOuoYCmUMG2",
"outputId": "cb3cf50f-611e-477e-cf95-6c10cd6c1df5"
},
"execution_count": null,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"### Out-of-Bag evaluation"
],
"metadata": {
"id": "9HiUsZa_VGBw"
}
},
{
"cell_type": "markdown",
"source": [
"It can be shown mathematically that only about 63% of the training instances are sampled on average for each predictor.6 The remaining 37% of the training instances that are not sampled are called out-of-bag (oob) instances. Note that they are not the same 37% for all predictors. The following calculate this number when `m=1000`:"
],
"metadata": {
"id": "LXRxTOnAa2Ev"
}
},
{
"cell_type": "code",
"source": [
"print(1 - (1 - 1 / 1000) ** 1000)\n",
"print(1 - np.exp(-1))"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "t_pOXuBaVRwI",
"outputId": "1104123d-7f8b-4160-f868-f6b17a2ee36c"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"0.6323045752290363\n",
"0.6321205588285577\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"In `Scikit-Learn`, you can set `oob_score=True` when creating a `BaggingClassifier` to request an automatic oob evaluation after training. The following code demonstrates this. The resulting evaluation score is available in the `oob_score_` attribute:"
],
"metadata": {
"id": "aMg1xzaMVHfr"
}
},
{
"cell_type": "code",
"source": [
"bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,\n",
" oob_score=True, n_jobs=-1, random_state=42)\n",
"bag_clf.fit(X_train, y_train)\n",
"bag_clf.oob_score_"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "MpOmKptvT65-",
"outputId": "418d2dc7-8f3e-4379-cf07-be3a423772cb"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.896"
]
},
"metadata": {},
"execution_count": 22
}
]
},
{
"cell_type": "markdown",
"source": [
"According to this oob evaluation, this `BaggingClassifier` is likely to achieve about 89.6% accuracy on the test set. Let’s verify this:"
],
"metadata": {
"id": "qjCSrZaaakSn"
}
},
{
"cell_type": "code",
"source": [
"y_pred = bag_clf.predict(X_test)\n",
"accuracy_score(y_test, y_pred)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "C5GIYPd1an13",
"outputId": "65ebf350-b3f7-4885-cd6b-3ef10aa70a76"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.92"
]
},
"metadata": {},
"execution_count": 27
}
]
},
{
"cell_type": "markdown",
"source": [
"We get 92% accuracy on the test. The oob evaluation was a bit too pessimistic, a bit over 2% too low."
],
"metadata": {
"id": "OmiahqHWaqqo"
}
},
{
"cell_type": "markdown",
"source": [
"The `BaggingClassifier` class supports **sampling the features** as well. Sampling is controlled by two hyperparameters: `max_features` and `bootstrap_features`. They work the same way as `max_samples` and `bootstrap`, but for feature sampling instead of instance sampling. Thus, each predictor will be trained on a random subset of the input features.\n",
"\n",
"This technique is particularly useful when you are dealing with high-dimensional inputs (such as images). **Sampling both training instances and features is called the Random Patches method**. Keeping all training instances (by setting `bootstrap=False` and `max_samples=1.0`) but sampling features (by setting `bootstrap_features` to True and/or `max_features` to a value smaller than 1.0) is called the **Random Subspaces method**."
],
"metadata": {
"id": "rva-jGH6bOmw"
}
},
{
"cell_type": "markdown",
"source": [
"### Random forest"
],
"metadata": {
"id": "9oWyFDYpdIwf"
}
},
{
"cell_type": "markdown",
"source": [
"Random Forest9 is an ensemble of Decision Trees, generally trained via the bagging method (or sometimes pasting), **typically with `max_samples` set to the size of the training set.** Instead of building a `BaggingClassifier` and passing it a `DecisionTreeClassifier`, you can use the `RandomForestClassifier` class, which is more convenient and **optimized for Decision Trees** (similarly, there is a `RandomForestRegressor` class for regression tasks). The follwong `BaggingClassifier` is equivalent to random forest\n",
"\n",
"```python\n",
"bag_clf = BaggingClassifier(\n",
" DecisionTreeClassifier(max_features=\"sqrt\", max_leaf_nodes=16),\n",
" n_estimators=500, n_jobs=-1, random_state=42)\n",
"```\n",
"\n",
"The following code trains a Random Forest classifier with 500 trees, each limited to maximum 16 nodes, and using all available CPU cores:"
],
"metadata": {
"id": "j_TkqrLMdMT4"
}
},
{
"cell_type": "code",
"source": [
"rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, random_state=42)\n",
"rnd_clf.fit(X_train, y_train)\n",
"y_pred_rf = rnd_clf.predict(X_test)"
],
"metadata": {
"id": "rBo1gB1MVWwI"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"y_pred = rnd_clf.predict(X_test)\n",
"accuracy_score(y_test, y_pred)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "X5-UVNb5jk-H",
"outputId": "63b8fb73-d95f-47f4-dcb5-fe0150c9e8d9"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.912"
]
},
"metadata": {},
"execution_count": 31
}
]
},
{
"cell_type": "markdown",
"source": [
"You can also create an Extra-Trees classifier using `Scikit-Learn`’s `ExtraTreesClassifier` class. Its API is identical to the `RandomForestClassifier` class, except bootstrap defaults to False. Similarly, the `ExtraTreesRegressor` class has the same API as the RandomForestRegressor class, except bootstrap defaults to False."
],
"metadata": {
"id": "9x22GreAfOSv"
}
},
{
"cell_type": "markdown",
"source": [
"## AdaBoost"
],
"metadata": {
"id": "Ro3Mdc0UjuAe"
}
},
{
"cell_type": "markdown",
"source": [
"Scikit-Learn uses a multiclass version of AdaBoost called `SAMME` (which stands for Stagewise Additive Modeling using a Multiclass Exponential loss function). When there are just two classes, SAMME is equivalent to AdaBoost. If the predictors can estimate class probabilities (i.e., if they have a `predict_proba()` method), Scikit-Learn can use a variant of SAMME called SAMME.R (the R stands for “Real”), which relies on class probabilities rather than predictions and generally performs better."
],
"metadata": {
"id": "bS8rugGHkMov"
}
},
{
"cell_type": "markdown",
"source": [
"The following code trains an AdaBoost classifier based on 30 Decision Stumps using `Scikit-Learn`’s `AdaBoostClassifier` class (as you might expect, there is also an `AdaBoostRegressor` class). A Decision Stump is a Decision Tree with `max_depth=1`—in other words, **a tree composed of a single decision node** plus two leaf nodes. This is the default base estimator for the `AdaBoostClassifier` class:"
],
"metadata": {
"id": "Pmgz2OtNkX-n"
}
},
{
"cell_type": "code",
"source": [
"ada_clf = AdaBoostClassifier(\n",
" DecisionTreeClassifier(max_depth=1), n_estimators=30,\n",
" learning_rate=0.5, random_state=42)\n",
"ada_clf.fit(X_train, y_train)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "aDhLJhmhdfw2",
"outputId": "7b365b79-8e1d-4ac6-ab4a-8dcd7433cf96"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),\n",
" learning_rate=0.5, n_estimators=30, random_state=42)"
]
},
"metadata": {},
"execution_count": 115
}
]
},
{
"cell_type": "code",
"source": [
"plot_decision_boundary(ada_clf, X_train, y_train)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 285
},
"id": "Gsh_380VkHX_",
"outputId": "20d453fc-efc5-4ea5-c391-af74e94502fd"
},
"execution_count": null,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"## Gradient Boosting"
],
"metadata": {
"id": "91HFYHBB-Bmv"
}
},
{
"cell_type": "markdown",
"source": [
"First, let’s generate a noisy quadratic dataset and fit a `DecisionTreeRegressor`to it:"
],
"metadata": {
"id": "2uM9bAyb-D1m"
}
},
{
"cell_type": "code",
"source": [
"np.random.seed(42)\n",
"X = np.random.rand(100, 1) - 0.5\n",
"y = 3 * X[:, 0] ** 2 + 0.05 * np.random.randn(100) # y = 3x² + Gaussian noise\n",
"\n",
"tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)\n",
"tree_reg1.fit(X, y)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "LgoG2Cg58-oC",
"outputId": "a702998b-7ece-466f-a438-3d2ada3b4f24"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"DecisionTreeRegressor(max_depth=2, random_state=42)"
]
},
"metadata": {},
"execution_count": 119
}
]
},
{
"cell_type": "markdown",
"source": [
"A simpler way to train GBRT ensembles is to use Scikit-Learn’s `GradientBoostingRegressor` class (there’s also a `GradientBoostingClassifier` class for classification). Much like the `RandomForestRegressor` class, it has hyperparameters to control the growth of Decision Trees (e.g., `max_depth`, `min_samples_leaf`), as well as hyperparameters to control the ensemble training, such as the number of trees (`n_estimator`s). "
],
"metadata": {
"id": "PXBmpTZ6-TUf"
}
},
{
"cell_type": "code",
"source": [
"gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0, random_state=42)\n",
"gbrt.fit(X, y)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "QUJv2gs2-pmE",
"outputId": "016aca12-5341-403d-ede4-03d9ddf357ad"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"GradientBoostingRegressor(learning_rate=1.0, max_depth=2, n_estimators=3,\n",
" random_state=42)"
]
},
"metadata": {},
"execution_count": 122
}
]
},
{
"cell_type": "markdown",
"source": [
"The `learning_rate` hyperparameter scales the contribution of each tree. If you set it to a low value, such as 0.05, you will need more trees in the ensemble to fit the training set, but the predictions will usually generalize better. This is a regularization technique called shrinkage.\n",
"\n",
"To find the optimal number of trees, you could perform cross-validation using `GridSearchCV` or `RandomizedSearchCV`, as usual, but there’s a simpler way: if you set the `n_iter_no_change` hyperparameter to an integer value, say 10, then the `GradientBoostingRegressor` will automatically stop adding more trees during training if it sees that the last 10 trees didn’t help. This is simply early stopping, but with a little bit of patience: it tolerates having no progress for a few iterations before it stops. Let’s train the ensemble using early stopping:"
],
"metadata": {
"id": "HONBXa5P--iv"
}
},
{
"cell_type": "code",
"source": [
"gbrt_best = GradientBoostingRegressor(\n",
" max_depth=2, learning_rate=0.05, n_estimators=500,\n",
" n_iter_no_change=10, random_state=42)\n",
"gbrt_best.fit(X, y)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "hSZK_Cvb-zke",
"outputId": "c4b43dca-b453-4619-c547-b0bb44bb4c50"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"GradientBoostingRegressor(learning_rate=0.05, max_depth=2, n_estimators=500,\n",
" n_iter_no_change=10, random_state=42)"
]
},
"metadata": {},
"execution_count": 123
}
]
},
{
"cell_type": "markdown",
"source": [
"If you set `n_iter_no_change` too low, training may stop too early and the model will underfit. But if you set it too high, it will overfit instead. We also set a fairly small learning rate and a high number of estimators, but the actual number of estimators in the trained ensemble is much lower, thanks to early stopping:"
],
"metadata": {
"id": "-JbdlvKh_hwO"
}
},
{
"cell_type": "code",
"source": [
"gbrt_best.n_estimators_"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "8zloEgk6_UJj",
"outputId": "77810466-8ed3-4d26-848c-2faea98bc8b4"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"92"
]
},
"metadata": {},
"execution_count": 124
}
]
},
{
"cell_type": "markdown",
"source": [
"When `n_iter_no_change` is set, the `fit(`) method automatically splits the training set into a smaller training set and a validation set: this allows it to evaluate the model’s performance each time it adds a new tree. The size of the validation set is controlled by the `validation_fraction` hyperparameter, which is 10% by default. The tol hyperparameter determines the maximum performance improvement that still counts as negligible. It defaults to 0.0001."
],
"metadata": {
"id": "L-vULKsT_mEO"
}
},
{
"cell_type": "markdown",
"source": [
"The GradientBoostingRegressor class also supports a subsample hyperparameter, which specifies the fraction of training instances to be used for training each tree. For example, if `subsample=0.25`, then each tree is trained on 25% of the training instances, selected randomly. As you can probably guess by now, this technique trades a higher bias for a lower variance. It also speeds up training considerably. This is called **Stochastic Gradient Boosting**."
],
"metadata": {
"id": "E8JMjIRF_xdd"
}
},
{
"cell_type": "markdown",
"source": [
"## XGBoost"
],
"metadata": {
"id": "KHUBGzQqZRhM"
}
},
{
"cell_type": "markdown",
"source": [
"### Classification task"
],
"metadata": {
"id": "HIsGDU0-bB-a"
}
},
{
"cell_type": "markdown",
"source": [
"Here are the essential steps to build an XGBoost classification model in scikit-learn using cross-validation."
],
"metadata": {
"id": "l41vqfEkbKcK"
}
},
{
"cell_type": "code",
"source": [
"iris = datasets.load_iris()\n",
"df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],columns= iris['feature_names'] + ['target'])\n",
"df.head()"
],
"metadata": {
"id": "ZVg6wzfA_Vyq",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"outputId": "6a21e676-4e2b-44e6-f754-6b1c55ed0d58"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"\n",
" target \n",
"0 0.0 \n",
"1 0.0 \n",
"2 0.0 \n",
"3 0.0 \n",
"4 0.0 "
],
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
sepal length (cm)
\n",
"
sepal width (cm)
\n",
"
petal length (cm)
\n",
"
petal width (cm)
\n",
"
target
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
5.1
\n",
"
3.5
\n",
"
1.4
\n",
"
0.2
\n",
"
0.0
\n",
"
\n",
"
\n",
"
1
\n",
"
4.9
\n",
"
3.0
\n",
"
1.4
\n",
"
0.2
\n",
"
0.0
\n",
"
\n",
"
\n",
"
2
\n",
"
4.7
\n",
"
3.2
\n",
"
1.3
\n",
"
0.2
\n",
"
0.0
\n",
"
\n",
"
\n",
"
3
\n",
"
4.6
\n",
"
3.1
\n",
"
1.5
\n",
"
0.2
\n",
"
0.0
\n",
"
\n",
"
\n",
"
4
\n",
"
5.0
\n",
"
3.6
\n",
"
1.4
\n",
"
0.2
\n",
"
0.0
\n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
]
},
"metadata": {},
"execution_count": 38
}
]
},
{
"cell_type": "code",
"source": [
"X_train, X_test, y_train, y_test = train_test_split(iris['data'], iris['target'], random_state=42)"
],
"metadata": {
"id": "G3R0nCm-Z4ly"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The following template is for building an XGBoost classifier"
],
"metadata": {
"id": "Q6wn7paXZpj-"
}
},
{
"cell_type": "code",
"source": [
"xgb = XGBClassifier(booster='gbtree', objective='multi:softprob', \n",
" learning_rate=0.1, n_estimators=100, random_state=42, n_jobs=-1)"
],
"metadata": {
"id": "rsjJb0POZ8Y7"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"1. `booster='gbtree'`: The booster is the base learner. It's the machine learning model that is constructed during every round of boosting. You may have guessed that 'gbtree' stands for gradient boosted tree, the XGBoost default base learner. It's uncommon but possible to work with other base learners, \n",
"\n",
"2. `objective='multi:softprob'`: Standard options for the objective can be viewed in the XGBoost official documentation, https://xgboost.readthedocs.io/en/latest/parameter.html, under Learning Task Parameters. The multi:softprob objective is a standard alternative to binary:logistic when the dataset includes multiple classes. It computes the probabilities of classification and chooses the highest one. If not explicitly stated, XGBoost will often find the right objective for you.\n",
"\n",
"3. `max_depth=6`: The max_depth of a tree determines the number of branches each tree has. It's one of the most important hyperparameters in making balanced predictions. XGBoost uses a default of 6, unlike random forests, which don't provide a value unless explicitly programmed.\n",
"\n",
"3. `learning_rate=0.1`: Within XGBoost, this hyperparameter is often referred to as eta. This hyperparameter limits the variance by reducing the weight of each tree to the given percentage. \n",
"\n",
"4. `n_estimators=100`: Popular among ensemble methods, `n_estimators` is the number of boosted trees in the model. Increasing this number while decreasing `learning_rate` can lead to more robust results."
],
"metadata": {
"id": "rDCT_1p9aNBi"
}
},
{
"cell_type": "code",
"source": [
"xgb.fit(X_train, y_train)\n",
"y_pred = xgb.predict(X_test)\n",
"score = accuracy_score(y_pred, y_test)\n",
"print('Score: ' + str(score))"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "QYLOzVdMazA5",
"outputId": "e270ed09-e3f4-4ade-d8df-cc3f1fd6433d"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Score: 1.0\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"xgb.get_params()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Q8LM-RkUQLDd",
"outputId": "cd49710a-45fa-4a2e-ce82-860b66c41fa0"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"{'base_score': 0.5,\n",
" 'booster': 'gbtree',\n",
" 'callbacks': None,\n",
" 'colsample_bylevel': 1,\n",
" 'colsample_bynode': 1,\n",
" 'colsample_bytree': 1,\n",
" 'early_stopping_rounds': None,\n",
" 'enable_categorical': False,\n",
" 'eval_metric': None,\n",
" 'gamma': 0,\n",
" 'gpu_id': -1,\n",
" 'grow_policy': 'depthwise',\n",
" 'importance_type': None,\n",
" 'interaction_constraints': '',\n",
" 'learning_rate': 0.1,\n",
" 'max_bin': 256,\n",
" 'max_cat_to_onehot': 4,\n",
" 'max_delta_step': 0,\n",
" 'max_depth': 6,\n",
" 'max_leaves': 0,\n",
" 'min_child_weight': 1,\n",
" 'missing': nan,\n",
" 'monotone_constraints': '()',\n",
" 'n_estimators': 100,\n",
" 'n_jobs': -1,\n",
" 'num_parallel_tree': 1,\n",
" 'objective': 'multi:softprob',\n",
" 'predictor': 'auto',\n",
" 'random_state': 42,\n",
" 'reg_alpha': 0,\n",
" 'reg_lambda': 1,\n",
" 'sampling_method': 'uniform',\n",
" 'scale_pos_weight': None,\n",
" 'subsample': 1,\n",
" 'tree_method': 'exact',\n",
" 'use_label_encoder': False,\n",
" 'validate_parameters': 1,\n",
" 'verbosity': None}"
]
},
"metadata": {},
"execution_count": 43
}
]
},
{
"cell_type": "markdown",
"source": [
"### Regression task"
],
"metadata": {
"id": "e4LM00_-bExu"
}
},
{
"cell_type": "markdown",
"source": [
"Here are the essential steps to build an XGBoost regression model in scikit-learn using cross-validation."
],
"metadata": {
"id": "7uFOM-SqbIZ6"
}
},
{
"cell_type": "code",
"source": [
"X,y = datasets.load_diabetes(return_X_y=True)\n",
"\n",
"xgb = XGBRegressor(booster='gbtree', objective='reg:squarederror', \n",
" learning_rate=0.1, n_estimators=100, random_state=42, n_jobs=-1)\n",
"\n",
"scores = cross_val_score(xgb, X, y, scoring='neg_mean_squared_error', cv=5)\n",
"\n",
"# Take square root of the scores\n",
"rmse = np.sqrt(-scores)\n",
"\n",
"# Display accuracy\n",
"print('RMSE:', np.round(rmse, 3))\n",
"\n",
"# Display mean score\n",
"print('RMSE mean: %0.3f' % (rmse.mean()))"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "q5w9Jclca8Gs",
"outputId": "7d67260e-3a04-4ba2-84f6-b95dc8227cf1"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"RMSE: [63.033 59.689 64.538 63.699 64.661]\n",
"RMSE mean: 63.124\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"xgb.fit(X,y)\n",
"xgb.get_params()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "CbihBstGRQom",
"outputId": "ce8ceff2-0207-4fac-e3c4-050b039028ad"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"{'base_score': 0.5,\n",
" 'booster': 'gbtree',\n",
" 'callbacks': None,\n",
" 'colsample_bylevel': 1,\n",
" 'colsample_bynode': 1,\n",
" 'colsample_bytree': 1,\n",
" 'early_stopping_rounds': None,\n",
" 'enable_categorical': False,\n",
" 'eval_metric': None,\n",
" 'gamma': 0,\n",
" 'gpu_id': -1,\n",
" 'grow_policy': 'depthwise',\n",
" 'importance_type': None,\n",
" 'interaction_constraints': '',\n",
" 'learning_rate': 0.1,\n",
" 'max_bin': 256,\n",
" 'max_cat_to_onehot': 4,\n",
" 'max_delta_step': 0,\n",
" 'max_depth': 6,\n",
" 'max_leaves': 0,\n",
" 'min_child_weight': 1,\n",
" 'missing': nan,\n",
" 'monotone_constraints': '()',\n",
" 'n_estimators': 100,\n",
" 'n_jobs': -1,\n",
" 'num_parallel_tree': 1,\n",
" 'objective': 'reg:squarederror',\n",
" 'predictor': 'auto',\n",
" 'random_state': 42,\n",
" 'reg_alpha': 0,\n",
" 'reg_lambda': 1,\n",
" 'sampling_method': 'uniform',\n",
" 'scale_pos_weight': 1,\n",
" 'subsample': 1,\n",
" 'tree_method': 'exact',\n",
" 'validate_parameters': 1,\n",
" 'verbosity': None}"
]
},
"metadata": {},
"execution_count": 50
}
]
},
{
"cell_type": "markdown",
"source": [
"Without a baseline of comparison, we have no idea what that score means. Converting the target column, `y`, into a pandas DataFrame with the `.describe()` method will give the quartiles and the general statistics of the predictor column, as follows:"
],
"metadata": {
"id": "HAY1SjZxboMz"
}
},
{
"cell_type": "code",
"source": [
"pd.DataFrame(y).describe()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 300
},
"id": "F35eRsafbXnV",
"outputId": "93fd4504-2aba-439f-e9b1-d85ed4f768fb"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" 0\n",
"count 442.000000\n",
"mean 152.133484\n",
"std 77.093005\n",
"min 25.000000\n",
"25% 87.000000\n",
"50% 140.500000\n",
"75% 211.500000\n",
"max 346.000000"
],
"text/html": [
"\n",
"